bayesian robust optimization
Bayesian Robust Optimization for Imitation Learning
One of the main challenges in imitation learning is determining what action an agent should take when outside the state distribution of the demonstrations. Inverse reinforcement learning (IRL) can enable generalization to new states by learning a parameterized reward function, but these approaches still face uncertainty over the true reward function and corresponding optimal policy. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework that optimizes a policy under the assumption of an adversarial reward function, whereas risk-neutral IRL approaches either optimize a policy for the mean or MAP reward function. While completely ignoring risk can lead to overly aggressive and unsafe policies, optimizing in a fully adversarial sense is also problematic as it can lead to overly conservative policies that perform poorly in practice. To provide a bridge between these two extremes, we propose Bayesian Robust Optimization for Imitation Learning (BROIL). BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk. Our empirical results show that BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms.
Supplementary Materials for Bayesian Robust Optimization for Imitation Learning Daniel S. Brown
When using the robust performance metric described in Section 4.2, we have We solve the above linear program to obtain the results presented in Section 5.1. Work done while at UT Austin. We use Scipy's linear programming software (v 1.4.1) MDP is solved to obtain the sample's likelihood and determine the transition probabilities within the Markov chain. We used a learning rate of 0.01.
Review for NeurIPS paper: Bayesian Robust Optimization for Imitation Learning
Clarity: Overall, I think the paper is fairly well written. I understand that the authors are working within the page restrictions of the conference. With that said, I think there is substantial room for improvement in the paper presentation. First, I think there are more specific ways to describe the contributions (copied from summary): 1) a linear programming formulation to compute the optimal policy for CVaR; 2) show how to use this to implement robust policy optimization under a prior and robust imitation learning; 3) demonstrate favorable comparisons with existing risk-sensitive and risk neutral algorithms for both settings. Right now I think that the description of the contributions hides the most useful contribution.
Bayesian Robust Optimization for Imitation Learning
One of the main challenges in imitation learning is determining what action an agent should take when outside the state distribution of the demonstrations. Inverse reinforcement learning (IRL) can enable generalization to new states by learning a parameterized reward function, but these approaches still face uncertainty over the true reward function and corresponding optimal policy. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework that optimizes a policy under the assumption of an adversarial reward function, whereas risk-neutral IRL approaches either optimize a policy for the mean or MAP reward function. While completely ignoring risk can lead to overly aggressive and unsafe policies, optimizing in a fully adversarial sense is also problematic as it can lead to overly conservative policies that perform poorly in practice. To provide a bridge between these two extremes, we propose Bayesian Robust Optimization for Imitation Learning (BROIL).